{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 用pandas将数据划分区间\n",
    "在数据建模的过程中，经常会遇到：年龄，收入，价格以及类似的数据，  \n",
    "对于这样的数据，经常会将这些数据放到指定的区间中，再将区间进行不同的编码，  \n",
    "编码后的数据，在进行建模。  \n",
    "在pandas中可以使用pandas.cut()方法实现对数据的区间划分。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 案例数据\n",
    "- 以姓名、年龄为例，使用pandas.cut()方法对age进行切分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df = pd.DataFrame(data={\n",
    "    \"name\":[\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"I\",\"J\"],\n",
    "    \"age\":[23,26,37,46,85,12,53,80,66,32],\n",
    "    \"score\":[13,23,22,76,56,89,99,100,10,54],\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "  name  age  score\n0    A   23     13\n1    B   26     23\n2    C   37     22\n3    D   46     76\n4    E   85     56\n5    F   12     89\n6    G   53     99\n7    H   80    100\n8    I   66     10\n9    J   32     54",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>name</th>\n      <th>age</th>\n      <th>score</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>A</td>\n      <td>23</td>\n      <td>13</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>B</td>\n      <td>26</td>\n      <td>23</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>C</td>\n      <td>37</td>\n      <td>22</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>D</td>\n      <td>46</td>\n      <td>76</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>E</td>\n      <td>85</td>\n      <td>56</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>F</td>\n      <td>12</td>\n      <td>89</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>G</td>\n      <td>53</td>\n      <td>99</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>H</td>\n      <td>80</td>\n      <td>100</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>I</td>\n      <td>66</td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>J</td>\n      <td>32</td>\n      <td>54</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  pandas.cut()介绍\n",
    "用来将数据划分为不同的区间  \n",
    "\n",
    "- x：array型数据（DataFrame的每一列数据都是array型数据）\n",
    "- bins：传入int型数据，表示划分的区间个数，传入liest型数据，表示自定义的区间\n",
    "- labels：与bins对应区间的标签（默认为None）\n",
    "- retbins：True表示返回划分的区间，False表示不返回划分的区间（默认为False）\n",
    "- right：True表示左开右闭，False表示左闭右开（默认为True）\n",
    "\n",
    "返回数据：\n",
    "- x对应所在的区间，array类型\n",
    "- retbins为True时，会返回划分区间"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一、自动划分区间\n",
    "eg：bins=3,right=True,pandas会将数据划分为3个区间,划分方法,  \n",
    "(max-(max-min)/bins,max]==>(60.667,85]  \n",
    "(max-(max-min)/bins\\*2,max-(max-min)/bins]==>(36.333.60.667]  \n",
    "(max-(max-min)/bins\\*3,max-(max-min)/bins\\*2]==>(11.927, 36.333] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "a,b = pd.cut(x=df[\"age\"],bins=3,right=True,retbins=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "0    (11.927, 36.333]\n1    (11.927, 36.333]\n2    (36.333, 60.667]\n3    (36.333, 60.667]\n4      (60.667, 85.0]\n5    (11.927, 36.333]\n6    (36.333, 60.667]\n7      (60.667, 85.0]\n8      (60.667, 85.0]\n9    (11.927, 36.333]\nName: age, dtype: category\nCategories (3, interval[float64]): [(11.927, 36.333] < (36.333, 60.667] < (60.667, 85.0]]"
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "array([11.927     , 36.33333333, 60.66666667, 85.        ])"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二、自定义划分区间\n",
    "\n",
    "eg：自定义一个年龄分段列表，age_bins = \\[10,20,30,50,70,80,90\\]  \n",
    "\n",
    "对应的区间为：\\[(10, 20\\] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]] \n",
    "\n",
    "这样pandas会按照age_bins指定的区间进行划分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "age_bins = [10,20,30,50,70,80,90]\n",
    "a,b = pd.cut(x=df[\"age\"],bins=age_bins,right=True,retbins=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "0    (20, 30]\n1    (20, 30]\n2    (30, 50]\n3    (30, 50]\n4    (80, 90]\n5    (10, 20]\n6    (50, 70]\n7    (70, 80]\n8    (50, 70]\n9    (30, 50]\nName: age, dtype: category\nCategories (6, interval[int64]): [(10, 20] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]]"
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "array([10, 20, 30, 50, 70, 80, 90])"
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 三、区间左边是否包含\n",
    "使用场景：当age为80的时候，应该归为(70,80]还是\\[80,90)，这是个问题  \n",
    "\n",
    "eg：bins = \\[10,20,30,50,70,80,90\\]，right=True\n",
    "\n",
    "对应的区间为：\\[(10, 20\\] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]]  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "(0    (20, 30]\n 1    (20, 30]\n 2    (30, 50]\n 3    (30, 50]\n 4    (80, 90]\n 5    (10, 20]\n 6    (50, 70]\n 7    (70, 80]\n 8    (50, 70]\n 9    (30, 50]\n Name: age, dtype: category\n Categories (6, interval[int64]): [(10, 20] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]],\n array([10, 20, 30, 50, 70, 80, 90]))"
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.cut(x=df.age,bins=age_bins,retbins=True,right=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "bins = \\[10,20,30,50,70,80,90\\]，right=False\n",
    "\n",
    "对应的区间为：\\[[10, 20) < [20, 30) < [30, 50) < [50, 70) < [70, 80) < [80, 90)]  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "(0    [20, 30)\n 1    [20, 30)\n 2    [30, 50)\n 3    [30, 50)\n 4    [80, 90)\n 5    [10, 20)\n 6    [50, 70)\n 7    [80, 90)\n 8    [50, 70)\n 9    [30, 50)\n Name: age, dtype: category\n Categories (6, interval[int64]): [[10, 20) < [20, 30) < [30, 50) < [50, 70) < [70, 80) < [80, 90)],\n array([10, 20, 30, 50, 70, 80, 90]))"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.cut(x=df.age,bins=age_bins,retbins=True,right=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 四、区间加上标签\n",
    "使用labels参数可以对区间加上标签，例如score列，小于60为不及格，60-80良好，80以上优秀  \n",
    "\n",
    "eg:bins:\\[0,60,80,100\\],labels：\\[\"不及格\",\"良好\",\"优秀\"\\]  \n",
    "\n",
    "返回的是对应的标签，而不是对应的区间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "0    不及格\n1    不及格\n2    不及格\n3     良好\n4    不及格\n5     优秀\n6     优秀\n7     优秀\n8    不及格\n9    不及格\nName: score, dtype: category\nCategories (3, object): [不及格 < 良好 < 优秀]"
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.cut(x=df.score,bins=[0,60,80,100],labels=[\"不及格\",\"良好\",\"优秀\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 给数据加上区间和标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"age_range\"] = pd.cut(x=df[\"age\"],bins=[10, 20, 30, 50, 70, 80, 90])\n",
    "df[\"score_label\"] = pd.cut(x=df[\"score\"],bins=[0,60,80,100],labels=[\"不及格\",\"良好\",\"优秀\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "  name  age  score age_range score_label\n0    A   23     13  (20, 30]         不及格\n1    B   26     23  (20, 30]         不及格\n2    C   37     22  (30, 50]         不及格\n3    D   46     76  (30, 50]          良好\n4    E   85     56  (80, 90]         不及格\n5    F   12     89  (10, 20]          优秀\n6    G   53     99  (50, 70]          优秀\n7    H   80    100  (70, 80]          优秀\n8    I   66     10  (50, 70]         不及格\n9    J   32     54  (30, 50]         不及格",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>name</th>\n      <th>age</th>\n      <th>score</th>\n      <th>age_range</th>\n      <th>score_label</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>A</td>\n      <td>23</td>\n      <td>13</td>\n      <td>(20, 30]</td>\n      <td>不及格</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>B</td>\n      <td>26</td>\n      <td>23</td>\n      <td>(20, 30]</td>\n      <td>不及格</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>C</td>\n      <td>37</td>\n      <td>22</td>\n      <td>(30, 50]</td>\n      <td>不及格</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>D</td>\n      <td>46</td>\n      <td>76</td>\n      <td>(30, 50]</td>\n      <td>良好</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>E</td>\n      <td>85</td>\n      <td>56</td>\n      <td>(80, 90]</td>\n      <td>不及格</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>F</td>\n      <td>12</td>\n      <td>89</td>\n      <td>(10, 20]</td>\n      <td>优秀</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>G</td>\n      <td>53</td>\n      <td>99</td>\n      <td>(50, 70]</td>\n      <td>优秀</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>H</td>\n      <td>80</td>\n      <td>100</td>\n      <td>(70, 80]</td>\n      <td>优秀</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>I</td>\n      <td>66</td>\n      <td>10</td>\n      <td>(50, 70]</td>\n      <td>不及格</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>J</td>\n      <td>32</td>\n      <td>54</td>\n      <td>(30, 50]</td>\n      <td>不及格</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 简单的分析一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "(30, 50]    3\n(50, 70]    2\n(20, 30]    2\n(80, 90]    1\n(70, 80]    1\n(10, 20]    1\nName: age_range, dtype: int64"
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"age_range\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "score_label  不及格  良好  优秀\nage_range               \n(10, 20]       0   0   1\n(20, 30]       2   0   0\n(30, 50]       2   1   0\n(50, 70]       1   0   1\n(70, 80]       0   0   1\n(80, 90]       1   0   0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th>score_label</th>\n      <th>不及格</th>\n      <th>良好</th>\n      <th>优秀</th>\n    </tr>\n    <tr>\n      <th>age_range</th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>(10, 20]</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>(20, 30]</th>\n      <td>2</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>(30, 50]</th>\n      <td>2</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>(50, 70]</th>\n      <td>1</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>(70, 80]</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>(80, 90]</th>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.pivot_table(data=df,index=\"age_range\",columns=\"score_label\",values=\"name\",aggfunc=lambda x:len(x),fill_value=0)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}